The dataset is about red wine quality. In this dataset, there are many samples with different qualities and attributes that I beleive affect the quality of wine.
Below I will be showing histograms of each variable and state below each graph the distribution.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The histogram shows volatile acidity and it looks like the distribution is somewaht normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar looks like lognormal after we decrease binwidth and add log layer.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Distribution of free sulfur dioxide looks lognormal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density distribution looks normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Distribution of sulphates looks lognormal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity looks kind of normal and kind of lognormal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Distirbution looks normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Distribution looks lognormal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Distribution of alcohol content looks lognormal.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Distirubiton looks normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
There are 13 columns in the dataset. There is X that represent sample number and there is quality of which is our main output. The rest are attributes. There are 1159 samples in the dataset. ### What is/are the main feature(s) of interest in your dataset? Quality of wine. I would like to predict quality of wine by checking different attributes in the wine sample. ### What other features in the dataset do you think will help support your
investigation into your feature(s) of interest? I think alcohol, density, and total sulfer dioxide are attributes that may affect quality of wine. ### Did you create any new variables from existing variables in the dataset? No. ### Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this? In the most part, I had to adjust binwidth to see how the data look like (normal vs lognormal). But I see there are strange distributions such as for variables citric acidity and alcohol.
Below I will be plotting scatter plots of variables against quality.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
We notice that for moderate quality wine, fixed acidity ranges between 5 and 11
ggplot(aes(x=volatile.acidity,y=quality),data=wine)+geom_jitter()
Here, moderate quality wine ranges between .35 and .8 in volatile acidity.
ggplot(aes(x=citric.acid,y=quality),data=wine)+geom_jitter()
Here, citric acid ranges between 0 and .5 for medium quality wine.
ggplot(aes(x=residual.sugar,y=quality),data=wine)+geom_jitter()
We notice that for moderate quality wine, residual sugar is at around level 2.
ggplot(aes(x=chlorides,y=quality),data=wine)+geom_jitter()
Most wine qualities has chlrides at .1
ggplot(aes(x=free.sulfur.dioxide,y=quality),data=wine)+geom_jitter()
Wine has a lot of variation when it comes to free sulfur dioxide. We see that values range from 0 to 40 for medium wines. But we notice that as wine quality improves, free sulfur dioxide decreases in value.
ggplot(aes(x=total.sulfur.dioxide,y=quality),data=wine)+geom_jitter()
We can condlue from this graph and above, that as wine quality improves, sulfur dioxide decreases in value.
ggplot(aes(x=density,y=quality),data=wine)+geom_jitter()
Not much can be said from this graph other than that value tend to decrease a bit as wine quality improves.
ggplot(aes(x=pH,y=quality),data=wine)+geom_jitter()
pH level tends to vary less as wine quality improves and tends to move towards 3.3
ggplot(aes(x=sulphates,y=quality),data=wine)+geom_jitter()
Sulphates value increases as wine quality improves.
ggplot(aes(x=alcohol,y=quality),data=wine)+geom_jitter()
As wine quality improves, alcohol level increases.
cor.test(wine$quality,wine$fixed.acidity,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
cor.test(wine$quality,wine$volatile.acidity,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
cor.test(wine$quality,wine$citric.acid,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
cor.test(wine$quality,wine$residual.sugar,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
cor.test(wine$quality,wine$chlorides,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
cor.test(wine$quality,wine$free.sulfur.dioxide,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
cor.test(wine$quality,wine$total.sulfur.dioxide,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
cor.test(wine$quality,wine$density,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
cor.test(wine$quality,wine$pH,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
cor.test(wine$quality,wine$sulphates,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
cor.test(wine$quality,wine$alcohol,method="pearson")
##
## Pearson's product-moment correlation
##
## data: wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
We see that some variables correlate with quality but there are variables that don’t correlate at all with it such as pH, free sulfer dioxide, and residual sugar. We can filter out those variables from the equation. ### Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)? Didn’t explore other variables. ### What was the strongest relationship you found? alcohol seems like the strongest predictor of quality of wine.
This shows that high quality wine correlates with high alcohol and low acidity content.Also, medium quality wine tend to correlate with low levels of alcohol regardless of volatile acidity value.
I am investigating the two variables that correlate the most with quality which are alcohol and volatile acidity. I notice that high quality wine has high alcohol content with low volatiel acidity while medium quality wine will have low alcohol with moderate acidity. ### Were there any interesting or surprising interactions between features? I notice that alchohol is still the strongest predictor. Even though acidity has an effect on quality, but if alcohol content is low, this will drop quality of wine. So acidity doesn’t have as much effect as wine but for wine to be of high quality, acidity must be low.
This shows one of the important predictors of wine quality which is alcohol. Its distribution in the sample is lognormal. This indicates that most wine has low alcohol content and thus mostly is low quality while not much wine is of high alcohol (high quality).
I didn’t show a scatter plot as most of them don’t show much of a relashionship. I chose to show histogram of volatile acidity that looks like a normal distribution indicating further that high quality wine is of rarity.
This shows that high quality wine correlates with high alcohol and low acidity content.
The dataset is problematic in the sense there is not much insights but there are many variables. It all trickles down that alcohol and acidity are the most important features and we notice that the rest of variables don’t have much effect on quality. Also, there are few values for quality. One step to do about this dataset is to collect more samples. In the future, we can collect price and see if price is reflected by quality or by other variables.